Introduction

Every year pedestrians continue to be involved in motor vehicle collisions. Thought the city continues to employ multiple methods to improve road safety for these users, the idea that inequality and safety are tied to the socioeconomic status of an area is prevalent. The purpose of this project is to study the relationship between a person’s socioeconomic status and their likelihood of being involved in a pedestrian – motor vehicle accident. This information can then be leveraged in determining locations where road safety improvements will have the biggest impact in lowering the socioeconomic inequality in the city of Toronto.

The purpose of this project is to determine if the socioeconomic status of a given area has an impact on the likelihood of pedestrians being hit by a vehicle.

Start by loading the various libraries and packages we will be using.

The “Pacman” package will automatically install any packages you do not currently have installed and load the applicable libraries.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, MASS, hablar, dplyr, spdplyr, sf, sp, spdep, spatialEco, rgdal, geojsonio, geojsonR, leaflet, leaflet.minicharts, leaflet.extras,leafsync, maptools, tmap, tidyverse, ggplot, gclus, rpart, arules, corrplot, corrgram, crs)

Create our Data Tables

KSI Pedestrian Data set

This data set is a subset of the Killed and Seriously Injured (KSI) data set collected by the Toronto Police Service from 2008-2018. These events include any serious or fatal collisions where a Pedestrian is involved. To learn more about Pedestrians related collisions in Toronto you can follow this link: http://data.torontopolice.on.ca/pages/pedestrians

We will use the FROM_GeoJson command from the “geojsonR” package to download a json file from the provided URL. The geojson_read command from the “geojsonio” and formatting form the “sp” package can then be used to read the json and generate a SpatialPointsDataFrame

ksi_ped <- geojsonio::geojson_read("https://opendata.arcgis.com/datasets/3dedc9bff625450990b8d480f397ad3f_0.geojson", what = "sp")
head(ksi_ped)

KSI TTC/Municipal Vehicle Data set

This data set is a subset of the Killed and Seriously Injured (KSI) data set collected by the Toronto Police Service from 2008-2018. These events include any serious or fatal collision involving an operator or passenger of a TTC, Transit Vehicle, streetcar or Municipal Vehicle. To learn more about TTC-Municipal Vehicle related collisions in Toronto you can follow this link: http://data.torontopolice.on.ca/pages/ttc-municipal-vehicle

We will use the FROM_GeoJson command from the “geojsonR” package to download a json file from the provided URL. The geojson_read command from the “geojsonio” and formatting form the “sp” package can then be used to read the json and generate a SpatialPointsDataFrame

ksi_ttc <- geojsonio::geojson_read("https://opendata.arcgis.com/datasets/dc4751278e604d65b0886b9765d4b551_0.geojson", what = "sp")
head(ksi_ttc)

Merge KSI_TTC and KSI_Ped

As you will have noticed from the descriptions both the KSI_PED and KSI_TTC data sets are themselves subsets of the Toronto Police Services’ KSI Data set. We downloaded them as separate data frames to enable faster downloads as they make up a small portion of the full KSI data set, which would take significantly more time to download and sort. Because they come from the same base data set and have the same schema we can merge them back into one data frame to work with. We will merge them based on the “Index_” to ensure there are no duplicates created due to the merging process.

ksi_merged <- merge(ksi_ped,ksi_ttc, by="Index_")

Neighbourhood Boundaries - City of Toronto

The City of Toronto also maintains a list that defines the boundaries of all the neighbourhoods in the city. A file containing the spatial data required to map these neighbourhoods can be downloaded form the City of Toronto open data portal. Due to how this file will download, unlike the KSI data, we cannot directly read the GeoJson file but have to download it before the file can be read.

Simply Analytics - Census Data

From Simply Analytics i have downloaded shape-files containing average income by census track and dissemination areas in the city. These two shape files will useful in drilling down beyond the neighbourhood. Both shape-files can be uploaded into a spatial data frame using the st_read command from the “sf” package we installed earlier then converting them to a SpatialPolygonsDataFrame.

ogrInfo(dsn="datasets/SimplyAnalytics_C1", layer="C1")
## Source: "C:\Users\gvs_j\Documents\GitHub\CSDA-1050F18S1\jacobgvs_304292\final\Datasets\SimplyAnalytics_C1", layer: "C1"
## Driver: ESRI Shapefile; number of rows: 572 
## Feature type: wkbPolygon with 2 dimensions
## Extent: (-79.6393 43.56034) - (-79.11347 43.85547)
## CRS: +proj=longlat +datum=WGS84 +no_defs  
## LDID: 87 
## Number of fields: 4 
##         name type length typeName
## 1 spatial_id    4     80   String
## 2       name    4     80   String
## 3     VALUE0    2     24     Real
## 4     VALUE1    2     24     Real
ctr <- readOGR(dsn = "datasets/SimplyAnalytics_C1", layer = "C1")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\gvs_j\Documents\GitHub\CSDA-1050F18S1\jacobgvs_304292\final\Datasets\SimplyAnalytics_C1", layer: "C1"
## with 572 features
## It has 4 fields
names(ctr@data)[names(ctr@data)=="VALUE0"] <- "Household Total Income After-Tax"
names(ctr@data)[names(ctr@data)=="VALUE1"] <- "Household Aggregate Income"
names(ctr@data)[names(ctr@data)=="VALUE2"] <- "Household Average Income"
head(ctr)
ogrInfo(dsn="datasets/SimplyAnalytics_C2", layer="C2")
## Source: "C:\Users\gvs_j\Documents\GitHub\CSDA-1050F18S1\jacobgvs_304292\final\Datasets\SimplyAnalytics_C2", layer: "C2"
## Driver: ESRI Shapefile; number of rows: 3702 
## Feature type: wkbPolygon with 2 dimensions
## Extent: (-79.6393 43.56034) - (-79.11347 43.85547)
## CRS: +proj=longlat +datum=WGS84 +no_defs  
## LDID: 87 
## Number of fields: 5 
##         name type length typeName
## 1 spatial_id    4     80   String
## 2       name    4     80   String
## 3     VALUE0    2     24     Real
## 4     VALUE1    2     24     Real
## 5     VALUE2    2     24     Real
disem <- readOGR(dsn = "datasets/SimplyAnalytics_C2", layer = "C2")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\gvs_j\Documents\GitHub\CSDA-1050F18S1\jacobgvs_304292\final\Datasets\SimplyAnalytics_C2", layer: "C2"
## with 3702 features
## It has 5 fields
names(disem@data)[names(disem@data)=="VALUE0"] <- "Household Total Income After-Tax"
names(disem@data)[names(disem@data)=="VALUE1"] <- "Household Aggregate Income"
names(disem@data)[names(disem@data)=="VALUE2"] <- "Household Average Income"
summary(disem)
## Object of class SpatialPolygonsDataFrame
## Coordinates:
##         min       max
## x -79.63930 -79.11347
## y  43.56034  43.85547
## Is projected: FALSE 
## proj4string :
## [+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0]
## Data attributes:
##     spatial_id                    name     
##  35200002:   1   DA0002, Toronto, ON:   1  
##  35200003:   1   DA0003, Toronto, ON:   1  
##  35200004:   1   DA0004, Toronto, ON:   1  
##  35200005:   1   DA0005, Toronto, ON:   1  
##  35200006:   1   DA0006, Toronto, ON:   1  
##  35200007:   1   DA0007, Toronto, ON:   1  
##  (Other) :3696   (Other)            :3696  
##  Household Total Income After-Tax Household Aggregate Income
##  Min.   :   0.0                   Min.   :        0         
##  1st Qu.: 159.0                   1st Qu.: 14441832         
##  Median : 202.5                   Median : 20666158         
##  Mean   : 300.6                   Mean   : 30880924         
##  3rd Qu.: 310.0                   3rd Qu.: 33046773         
##  Max.   :4889.0                   Max.   :534324916         
##                                                             
##  Household Average Income
##  Min.   :      0         
##  1st Qu.:  72481         
##  Median :  90304         
##  Mean   : 116745         
##  3rd Qu.: 120148         
##  Max.   :2340057         
## 
head(disem@data)

Reviewing the Data sets

With our data sets downloaded we can start by looking through the data. Because we are dealing almost exclusively with spatial data the easiest way to get a handle on the data is by mapping it. To do this I will be using features from the “leaflet” package.

KSI_merged Data set

Lets start by getting an idea of where accidents are occurring. The leaflet package will map the data for us. To make this more readable I have had the system auto cluster the accidents. These clusters don’t have any relation to specific neighbourhoods and will need to be adjusted later so that the clusters are in line with our other data sets.

leaflet(ksi_merged) %>%
  addTiles() %>%
  addMarkers(lng = ksi_merged$LONGITUDE, lat = ksi_merged$LATITUDE, clusterOptions = markerClusterOptions())

This map will allow you to zoom in and the clusters will auto adjust as you zoom in and out. These clusters are based on proximity to a central point. Once you get to the lowest zoom levels, clicking on a cluster will map the individual accidents. Details for each accident are not currently included in the mapping.

Neighbourhood Boundaries - City of Toronto

Our next data set contains the boundary lines for various neighbourhoods in Toronto. mapping this will begin to give some dimension to how we intend to cluster accidents going forward.

leaflet(nbh) %>%
  addTiles() %>%
  addPolygons()

Simply Analytics - Census Data

We can similarly plot both the census tract and dissemination area files pulled from Simply Analytics.

leaflet(ctr) %>%
  addTiles() %>%
  addPolygons()
leaflet(disem) %>%
  addTiles() %>%
  addPolygons()

Our next step will be to assign each accident in the KSI data to a dissemination area polygon. But before we can do that we will first have to clean up the various data sets.

Cleaning the Data

Before we begin we need to ensure that our various data sets are compatible.

Using the CRS function in the SF package We can check the coordinate reference system being used by our main files and can see that they are all the same.

st_crs(ksi_merged)
## Call:
## NULL
st_crs(ctr)
## Call:
## NULL
st_crs(disem)
## Call:
## NULL
st_crs(nbh)
## Call:
## NULL

We luckily all the files we have downloaded use the same co-ordinate reference system so we will be able to compare and associate them to one another without having to re-project them into a common co-ordinate system. This is especially convenient given that the mapping package w are using “leaflet” does not support “CRS” one of the other reference systems.

Cleaning the KSI data

As we review the data sets one thing you may have noticed is that the KSI data set contains a large number of potentially duplicate entries.

head(ksi_merged@data, 20)

While each accident is assigned a different index id number(Index_), in most cases multiple index numbers are assigned to the same account number(ACCNUM). This is because, for each accident, the driver of the vehicle, the pedestrian, and any other applicable individual related to the loss has their information recorded to the data set. This is most evident if you look at the involved vehicle type (INVTYPE) columns.

Since this review is focused on pedestrian accidents, we will filter our KSI_Merged file to include only instances where the column INVTYPE is a pedestrian.

We can easily identify all unique items in the INVTYPE column and select those that are representative of pedestrians.

unique(ksi_merged@data$INVTYPE)
##  [1] Driver               Pedestrian           Vehicle Owner       
##  [4] Passenger            Truck Driver         Other               
##  [7] Wheelchair           Pedestrian - Not Hit Motorcycle Driver   
## [10] Driver - Not Hit     Other Property Owner                     
## [13] In-Line Skater       Runaway - No Driver  Cyclist             
## [16] Witness              Trailer Owner       
## 17 Levels:   Cyclist Driver Driver - Not Hit ... Witness

If we look at our list of options, there are a few different entries we should be considering to be a pedestrian. For this review we will be filtering out all entries where the INVTYPE is “Pedestrian”, “Pedestrian - Not Hit”, “In-Line Skater”, or “Wheelchair”.

So as not to lose our merged data set, we will also create a new filtered data table. We can easily identify and, if required in the future, modify the list of items being filtered into our data table by setting them as a target

target_invtype <- c("Pedestrian", "Pedestrian - Not Hit", "In-Line Skater", "Wheelchair")
ksi_modified <- ksi_merged[ksi_merged@data$INVTYPE %in% target_invtype,]
head(ksi_modified@data, 20)

There are also a number of columns that will not be needed for our review and can be removed from our modified data set. These include: OFFSET, VEHTYPE, MANOEUVER, DRIVACT, DRIVCOND, CYCLISTYPE, CYCACT, CYCCOND, CYCLIST, TRSN_CITY_VEH, TRSN_CITY_, coords.x1, coords.x2, Ward_Name, Ward_ID, Hood_ID, Hood_Name, IMPACTYPE, INVTYPE, PEDESTRIAN, ACCLASS, Division, ObjectId

target_coll <- c("OFFSET", "VEHTYPE", "MANOEUVER", "DRIVACT", "DRIVCOND", "CYCLISTYPE", "CYCACT", "CYCCOND", "CYCLIST", "TRSN_CITY_VEH", "TRSN_CITY_", "coords.x1", "coords.x2", "Ward_Name", "Ward_ID", "Hood_ID", "Hood_Name", "IMPACTYPE", "INVTYPE", "PEDESTRIAN", "ACCLASS", "Division", "ObjectId")
ksi_modified <- ksi_modified[,-which(names(ksi_modified@data) %in% c(target_coll))]
head(ksi_modified@data, 20)

Cleaning the Toronto Neighbourhood data

Next we will look at the Toronto Neighbourhood data set.

head(nbh@data, 20)

We can see that for the most part all columns in this data set are useful. The columns that will not be of use and can be removed are: PARENT_AREA_ID - this field contains no unique values AREA_ATTR_ID - this is variable is identical to AREA_ID AREA_LONG_CODE - this is variable is identical to AREA_SHORT_CODE AREA_SHORT_CODE - this variable is contained withing AREA_NAME AREA_DESC - this is variable is identical to AREA_NAME X - this field contains no unique values Y - this field contains no unique values

unique(nbh@data$PARENT_AREA_ID)
## [1] 49885
unique(nbh@data$X)
## [1] <NA>
## Levels:
unique(nbh@data$Y)
## [1] <NA>
## Levels:
targetnbh <- c("PARENT_AREA_ID", "AREA_ATTR_ID", "AREA_SHORT_CODE" ,"AREA_LONG_CODE", "AREA_DESC", "X", "Y")
nbh_modified <- nbh
nbh_modified <- nbh_modified[,-which(names(nbh_modified@data) %in% c(targetnbh))]
head(nbh_modified@data, 20)

Cleaning the Census Tract and Dissemination data

Next we will look at the Toronto Neighbourhood data set.

head(ctr@data, 20)

We can see that these have a very minimal number of fields. We will however remove a few prior to joining our data sets just to simplify the final output.

ctr_modified <- ctr
ctr_modified <- ctr_modified[,-which(names(ctr_modified@data) %in% c("name", "Household Total Income After-Tax", "Household Aggregate Income"))]
head(ctr_modified@data, 20)

We will make the same modification to the dissemination area data.

disem_modified <- disem
disem_modified <- disem_modified[,-which(names(disem_modified@data) %in% c("name", "Household Total Income After-Tax", "Household Aggregate Income"))]
head(disem_modified@data, 20)

Associating the Data sets

We are interested in associating all our spatial polygon data frames to the KSI data points that represent each pedestrian related accident.

Our first task will be to join our KSI data set to a Toronto Neighbourhood.

This can be done using the spatialEco package. The spatialEco::point.in.poly function intersects point and polygon feature classes and adds polygon attributes to points. This function will re-name columns with similar names so we will also address re-naming some of the newly added columns.

ksi_coord <- spatialEco::point.in.poly(ksi_modified, nbh_modified)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
names(ksi_coord@data)[names(ksi_coord@data)=="LONGITUDE.x"] <- "LONGITUDE"
names(ksi_coord@data)[names(ksi_coord@data)=="LATITUDE.x"] <- "LATITUDE"
names(ksi_coord@data)[names(ksi_coord@data)=="LONGITUDE.y"] <- "LONGITUDE.nbh"
names(ksi_coord@data)[names(ksi_coord@data)=="LATITUDE.y"] <- "LATITUDE.nbh"
head(ksi_coord@data, 20)

Next we will add in the dissemination area information. Being the most granular this will allow us to accurately associate an average income to each accident.

ksi_coord <- spatialEco::point.in.poly(ksi_coord, disem_modified)
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
head(ksi_coord@data, 20)

Data Exploration

Now that we have a assigned an average income to each accident we can start working to review our data and learn more about what story it can tell.

To allow us to more easily review the data set we will also convert it from a SpatialPointsDataFrame to a regular data frame. We can then pull some basic information about household income.

ksi_sf <- as.data.frame(ksi_coord@data)
colSums(is.na(ksi_sf))
##                   Index_                   ACCNUM                     YEAR 
##                        0                        0                        0 
##                     DATE                     TIME                     Hour 
##                        0                        0                        0 
##                  STREET1                  STREET2               ROAD_CLASS 
##                        0                        0                        0 
##                 District                 LATITUDE                LONGITUDE 
##                        0                        0                        0 
##                 LOCCOORD                   ACCLOC                 TRAFFCTL 
##                        0                        0                        0 
##               VISIBILITY                    LIGHT                 RDSFCOND 
##                        0                        0                        0 
##                   INVAGE                   INJURY                 FATAL_NO 
##                        0                        0                        0 
##                  INITDIR                  PEDTYPE                   PEDACT 
##                        0                        0                        0 
##                  PEDCOND               AUTOMOBILE               MOTORCYCLE 
##                        0                        0                        0 
##                    TRUCK                EMERG_VEH                PASSENGER 
##                        0                        0                        0 
##                 SPEEDING                  AG_DRIV                 REDLIGHT 
##                        0                        0                        0 
##                  ALCOHOL               DISABILITY                     X_id 
##                        0                        0                       11 
##                  AREA_ID                AREA_NAME            LONGITUDE.nbh 
##                       11                       11                       11 
##             LATITUDE.nbh                 OBJECTID              Shape__Area 
##                       11                       11                       11 
##            Shape__Length               spatial_id Household.Average.Income 
##                       11                       12                       12

This function produces a data frame with infinite values which will cause some issues. So first we will omit these rows from the data frame.

ksi_sf <- na.omit(ksi_sf)
colSums(is.na(ksi_sf))
##                   Index_                   ACCNUM                     YEAR 
##                        0                        0                        0 
##                     DATE                     TIME                     Hour 
##                        0                        0                        0 
##                  STREET1                  STREET2               ROAD_CLASS 
##                        0                        0                        0 
##                 District                 LATITUDE                LONGITUDE 
##                        0                        0                        0 
##                 LOCCOORD                   ACCLOC                 TRAFFCTL 
##                        0                        0                        0 
##               VISIBILITY                    LIGHT                 RDSFCOND 
##                        0                        0                        0 
##                   INVAGE                   INJURY                 FATAL_NO 
##                        0                        0                        0 
##                  INITDIR                  PEDTYPE                   PEDACT 
##                        0                        0                        0 
##                  PEDCOND               AUTOMOBILE               MOTORCYCLE 
##                        0                        0                        0 
##                    TRUCK                EMERG_VEH                PASSENGER 
##                        0                        0                        0 
##                 SPEEDING                  AG_DRIV                 REDLIGHT 
##                        0                        0                        0 
##                  ALCOHOL               DISABILITY                     X_id 
##                        0                        0                        0 
##                  AREA_ID                AREA_NAME            LONGITUDE.nbh 
##                        0                        0                        0 
##             LATITUDE.nbh                 OBJECTID              Shape__Area 
##                        0                        0                        0 
##            Shape__Length               spatial_id Household.Average.Income 
##                        0                        0                        0

There are also a number of instances where a household income of 0 was identified. This occurs due to the spatial data points not properly associating to a polygon. There are a total of 9 entries with 0$ incomes so we will subset our data set to remove these points prior to analysis.

ksi_sf <- subset(ksi_sf, Household.Average.Income > 0)
summary(ksi_sf)
##      Index_                ACCNUM          YEAR     
##  Min.   : 3738222   1366347   :   7   Min.   :2008  
##  1st Qu.: 6207626   1045692   :   4   1st Qu.:2010  
##  Median : 7815052   5000979465:   4   Median :2013  
##  Mean   :38362619   5001836357:   4   Mean   :2013  
##  3rd Qu.:80566635   5002211839:   4   3rd Qu.:2016  
##  Max.   :81074666   1021860   :   3   Max.   :2018  
##                     (Other)   :2070                 
##                      DATE           TIME           Hour     
##  2015/10/25 04:00:00+00:   8   1900   :  12   Min.   : 0.0  
##  2013/07/03 04:00:00+00:   7   2100   :  12   1st Qu.: 9.0  
##  2010/01/12 05:00:00+00:   6   1750   :  11   Median :14.0  
##  2015/12/26 05:00:00+00:   6   1830   :  11   Mean   :13.6  
##  2016/12/06 05:00:00+00:   6   2240   :  11   3rd Qu.:19.0  
##  2012/12/18 05:00:00+00:   5   1450   :  10   Max.   :23.0  
##  (Other)               :2058   (Other):2029                 
##            STREET1               STREET2              ROAD_CLASS  
##  YONGE ST      :  52                 : 195   Major Arterial:1456  
##  BATHURST ST   :  47   BATHURST ST   :  31   Minor Arterial: 319  
##  EGLINTON AVE E:  43   YONGE ST      :  21   Collector     : 134  
##  FINCH AVE W   :  39   EGLINTON AVE E:  20   Local         : 116  
##  DUNDAS ST W   :  38   BAY ST        :  18   Expressway    :  65  
##  LAWRENCE AVE E:  35   DUNDAS ST W   :  17   Laneway       :   2  
##  (Other)       :1842   (Other)       :1794   (Other)       :   4  
##                   District      LATITUDE       LONGITUDE     
##                       :  1   Min.   :43.59   Min.   :-79.62  
##  Etobicoke York       :385   1st Qu.:43.66   1st Qu.:-79.45  
##  North York           :406   Median :43.70   Median :-79.40  
##  Scarborough          :493   Mean   :43.71   Mean   :-79.39  
##  Toronto and East York:810   3rd Qu.:43.76   3rd Qu.:-79.32  
##  Toronto East York    :  1   Max.   :43.84   Max.   :-79.13  
##                                                              
##                                 LOCCOORD                      ACCLOC    
##                                     :  11   At Intersection      :1028  
##  Intersection                       :1477                        : 617  
##  Mid-Block                          : 606   Non Intersection     : 252  
##  Park, Private Property, Public Lane:   2   Intersection Related : 147  
##                                             At/Near Private Drive:  40  
##                                             Laneway              :   6  
##                                             (Other)              :   6  
##                  TRAFFCTL                    VISIBILITY  
##  No Control          :979   Clear                 :1711  
##  Traffic Signal      :912   Rain                  : 315  
##  Stop Sign           :142   Snow                  :  37  
##  Pedestrian Crossover: 40   Other                 :  16  
##  Traffic Controller  :  9   Fog, Mist, Smoke, Dust:   5  
##  Streetcar (Stop for):  6   Freezing Rain         :   5  
##  (Other)             :  8   (Other)               :   7  
##               LIGHT            RDSFCOND         INVAGE         INJURY    
##  Daylight        :1112   Dry       :1583   25 to 29: 176          :   0  
##  Dark, artificial: 441   Wet       : 455   20 to 24: 163   Fatal  : 349  
##  Dark            : 423   Other     :  22   50 to 54: 157   Major  :1630  
##  Dusk, artificial:  34   Loose Snow:  14   65 to 69: 141   Minimal:  32  
##  Dusk            :  33   Slush     :  13   55 to 59: 138   Minor  :  65  
##  Dawn, artificial:  22             :   4   30 to 34: 137   None   :  20  
##  (Other)         :  31   (Other)   :   5   (Other) :1184                 
##     FATAL_NO         INITDIR   
##  Min.   : 0.000          :149  
##  1st Qu.: 0.000   East   :402  
##  Median : 0.000   North  :489  
##  Mean   : 5.038   South  :376  
##  3rd Qu.: 0.000   Unknown:325  
##  Max.   :78.000   West   :355  
##                                
##                                                              PEDTYPE   
##  Pedestrian hit at mid-block                                     :513  
##  Vehicle turns left while ped crosses with ROW at inter.         :419  
##  Vehicle is going straight thru inter.while ped cross without ROW:355  
##  Pedestrian hit on sidewalk or shoulder                          :149  
##  Vehicle turns right while ped crosses with ROW at inter.        :119  
##  Vehicle is going straight thru inter.while ped cross with ROW   :117  
##  (Other)                                                         :424  
##                            PEDACT                              PEDCOND    
##  Crossing with right of way   :653   Normal                        :1123  
##  Crossing, no Traffic Control :475   Inattentive                   : 377  
##  Crossing without right of way:261   Unknown                       : 275  
##  On Sidewalk or Shoulder      :165   Had Been Drinking             : 155  
##  Other                        :164   Medical or Physical Disability:  51  
##  Running onto Roadway         :153   Other                         :  38  
##  (Other)                      :225   (Other)                       :  77  
##  AUTOMOBILE MOTORCYCLE TRUCK      EMERG_VEH  PASSENGER  SPEEDING  
##     : 276      :2073      :1996      :2094      :1883      :1975  
##  Yes:1820   Yes:  23   Yes: 100   Yes:   2   Yes: 213   Yes: 121  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##  AG_DRIV    REDLIGHT   ALCOHOL    DISABILITY      X_id      
##     :1212      :2025      :2059      :2086   Min.   :701.0  
##  Yes: 884   Yes:  71   Yes:  37   Yes:  10   1st Qu.:743.0  
##                                              Median :769.0  
##                                              Mean   :771.3  
##                                              3rd Qu.:802.0  
##                                              Max.   :840.0  
##                                                             
##     AREA_ID                                          AREA_NAME   
##  Min.   :25886328   Waterfront Communities-The Island (77):  89  
##  1st Qu.:25886494   Bay Street Corridor (76)              :  55  
##  Median :25886655   Kensington-Chinatown (78)             :  43  
##  Mean   :25886678   West Humber-Clairville (1)            :  43  
##  3rd Qu.:25886874   Wexford/Maryvale (119)                :  40  
##  Max.   :25886994   Church-Yonge Corridor (75)            :  38  
##                     (Other)                               :1788  
##  LONGITUDE.nbh     LATITUDE.nbh      OBJECTID         Shape__Area      
##  Min.   :-79.60   Min.   :43.59   Min.   :16491505   Min.   :  811304  
##  1st Qu.:-79.46   1st Qu.:43.66   1st Qu.:16492177   1st Qu.: 3743076  
##  Median :-79.39   Median :43.70   Median :16492593   Median : 8765411  
##  Mean   :-79.39   Mean   :43.71   Mean   :16492630   Mean   :11689079  
##  3rd Qu.:-79.32   3rd Qu.:43.76   3rd Qu.:16493121   3rd Qu.:14303504  
##  Max.   :-79.15   Max.   :43.82   Max.   :16493729   Max.   :72144024  
##                                                                        
##  Shape__Length      spatial_id   Household.Average.Income
##  Min.   : 3559   35200855:  25   Min.   :  22158         
##  1st Qu.: 9594   35204479:  18   1st Qu.:  65025         
##  Median :14113   35202098:  14   Median :  82856         
##  Mean   :16362   35200094:  11   Mean   : 100459         
##  3rd Qu.:19657   35201890:  10   3rd Qu.: 108063         
##  Max.   :59561   35204900:  10   Max.   :1065780         
##                  (Other) :2008

With those removed, lets begin by reviewing some of our data variables to see how informative they will be.

Merging and Transforming Variables

If we look at the summary above we can see a series of columns with a Yes and a blank. (AUTOMOBILE, MOTORCYCLE, TRUCK, EMERG_VEH, PASSENGER, SPEEDING, AG_DRIV, REDLIGHT, ALCOHOL, DISABILITY) The first 4 represent the type of vehicle involved in the accident. These can be combined into one column for easy review.

ksi_sf$VEH <- if_else_(ksi_sf$AUTOMOBILE == "Yes", "Automobile", 
                       if_else_(ksi_sf$MOTORCYCLE == "Yes","Motorcycle", 
                                if_else_(ksi_sf$TRUCK == "Yes","Truck", 
                                         if_else_(ksi_sf$EMERG_VEH == "Yes","Emergency Veh","NA"))))

ggplot(ksi_sf, aes(x = VEH, y = ..prop.., group = 1))+
  geom_bar(fill = "#0073C2FF")+
  ggtitle("Accidents by Vehicle Type")

We can see that almost 90% of all instances involved an automobile rendering this string of variables statistically unimportant and giving a good indication that they can be removed from our data set. We will however keep the combined column that we created to review them.

ksi_sf <- ksi_sf[,-which(names(ksi_sf) %in% c("AUTOMOBILE", "MOTORCYCLE", "TRUCK", "EMERG_VEH"))]

If we look at the other variables, they are all characteristics of the driver who was involved in the accident. It is a well established fact that factors like speeding, aggressive driving, running red lights, and driving while under the influence of alcohol are all going to make you more likely to be involved in an accident. However, this project is focused on the pedestrian, the surrounding community and controllable factors related to where the accident occurred that could be resulting in more pedestrian accidents. As such, we will not be focusing on any of these variables; we will however address the blank values.

ksi_sf$PASSENGER <- ifelse(ksi_sf$PASSENGER == "Yes", "Yes", "No")
ksi_sf$SPEEDING <- ifelse(ksi_sf$SPEEDING == "Yes", "Yes", "No")
ksi_sf$AG_DRIV <- ifelse(ksi_sf$AG_DRIV == "Yes", "Yes", "No")
ksi_sf$REDLIGHT <- ifelse(ksi_sf$REDLIGHT == "Yes", "Yes", "No")
ksi_sf$ALCOHOL <- ifelse(ksi_sf$ALCOHOL == "Yes", "Yes", "No")
ksi_sf$DISABILITY <- ifelse(ksi_sf$DISABILITY == "Yes", "Yes", "No")

We will also merge most of these values into one column so if needed they can be graphed against our other variables.

ksi_sf$VEHFACTORS <- if_else_(ksi_sf$SPEEDING == "Yes", "Speeding", 
                       if_else_(ksi_sf$AG_DRIV == "Yes","Aggressive Driving", 
                                if_else_(ksi_sf$REDLIGHT == "Yes","Ran Red Light", 
                                         if_else_(ksi_sf$ALCOHOL == "Yes","Alcohol",
                                                  if_else_(ksi_sf$DISABILITY == "Yes","Disability", "NA")))))

ksi_sf$PASSENGER <- ifelse(ksi_sf$PASSENGER == "Yes", 1, 0)
ksi_sf$SPEEDING <- ifelse(ksi_sf$SPEEDING == "Yes", 1, 0)
ksi_sf$AG_DRIV <- ifelse(ksi_sf$AG_DRIV == "Yes", 1, 0)
ksi_sf$REDLIGHT <- ifelse(ksi_sf$REDLIGHT == "Yes", 1, 0)
ksi_sf$ALCOHOL <- ifelse(ksi_sf$ALCOHOL == "Yes", 1, 0)
ksi_sf$DISABILITY <- ifelse(ksi_sf$DISABILITY == "Yes", 1, 0)

Graphing Variables

Next we will review the other variables in our data set.

The easiest way to gauge impact will be to graphically represent each variable based on frequency. WE will run through each variable in order commenting on what each is telling us about the data.

Frequency by Year

ggplot(data = ksi_sf)+
  geom_bar(mapping = aes(YEAR), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Accidents by Year")

We can clearly see that the total number of accidents per year does not usually fluctuate all that much. Typically frequency will fluctuate in the 175 to 200 item range only dropping below 175 three times and exceeding 200 four times over the 11 years of data.

Frequency by Road Class

ggplot(data = ksi_sf)+
  geom_bar(mapping = aes(ROAD_CLASS), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Accidents by Class of Road")

Virtually every pedestrian accident occurs on what is considered a Major Arterial road.

Frequency by District

ggplot(data = ksi_sf)+
  geom_bar(mapping = aes(District), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Accidents by District")

This gives us a good idea of which general area is seeing the most pedestrian accidents. It is clear that the largest number of accidents are occurring in Toronto and East York which is expected given this district contains the downtown core.

Frequency by Driver Influence

ggplot(data = ksi_sf)+
  geom_bar(mapping = aes(VEHFACTORS), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Driver Influenced by")+
  ggtitle("Accidents by Driver Influence")

We can see that the when an accident occurs, a large number of those accidents are resulting as a result of Aggressive Driving

Frequency by Accident Location

ggplot(data = ksi_sf)+
  geom_bar(mapping = aes(ACCLOC), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Accidents by Accident Location")

We can see that a large number of accidents are occurring at intersections. Unfortunately this field also incomplete with a large number of entries without.

Frequency by Traffic Control Measures

ggplot(data = ksi_sf)+
  geom_bar(mapping = aes(TRAFFCTL), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Accidents by Traffic Control Measures")

We can see that most accidents occur in areas with no traffic control measures or at traffic lights.

Frequency by Visibility

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(VISIBILITY), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Accidents by Visibility")

Most accidents are also occurring when there are no factors impacting visibility

Frequency by Type of Light

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(LIGHT), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Type of Lighting")+
  ggtitle("Accidents by Type of Lighting")

Most accidents are also occurring in daylight with minimal numbers of accidents occurring during the other times of day when you would expect visibility issues to be of a bigger impact.

Frequency by Road Condition

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(RDSFCOND), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Road Condition")+
  ggtitle("Accidents by Road Condition")

These accidents are also occurring primarily in dry condition as opposed to conditions that would be more likely to reduce visibility or the drivers ability to handle the vehicle.

Frequency by Vehicle Action

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(PEDTYPE), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Pedestrian Action")+
  ggtitle("Accidents by Pedestrian Type")

This is one of the most interesting variable we have available and the least intuitive. Of all pedestrian accidents the most frequent involves pedestrians crossing mid-block. The next two most frequent are both crossings at intersections where we can see that pedestrians are actually more likely to be hit while crossing with the right of way as opposed to crossing without the right of way.

Frequency by Pedestrian Action

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(PEDACT), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Pedestrian Action")+
  ggtitle("Accidents by Pedestrian Action")

Surprisingly the pedestrian crossing mid-block form the graph above have disappeared.

Frequency by Pedestrian Condition

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(PEDCOND), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Pedestrian Condition")+
  ggtitle("Accidents by Pedestrian Condition")

Finally, we can see that pedestrian condition (aside from the pedestrian being inattentive) appears to have little impact on Pedestrian accidents.

Average Household Income

Since we are studying whether socioeconomic factors impact the likelihood of an accident we also need to review the average household income.

We can start by creating a histogram of the Household Average Income just to get an idea of the income spread we are working with.

qplot(ksi_sf$Household.Average.Income,
      geom="histogram",
      binwidth = 5,
      main = "Histogram Average Income",
      xlab = "Average Income",
      fill=I("blue"), 
      col=I("red"))

As we can see the majority of losses are clustered at the bottom end of the income scale between $0 and $150k. However this histogram can be deceiving. The average income varies so greatly between the lowest and highest values that it is hard to identify what income brackets this cluster actually represents.

We can re-present the same graph with an increased bin width to help make the data more meaningful.

qplot(ksi_sf$Household.Average.Income,
      geom="histogram",
      main = "Histogram Average Income",
      xlab = "Average Income",
      fill=I("blue"), 
      col=I("red"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ultimately there is too much variety in average household income. To get a better and more meaningful graph of income we would want to convert income to specified ranges.

We can pull some basic variables to determine what those ranges should be.

mean(ksi_sf$Household.Average.Income)
## [1] 100458.5
mode(ksi_sf$Household.Average.Income)
## [1] "numeric"
range(ksi_sf$Household.Average.Income)
## [1]   22158 1065780
quantile(ksi_sf$Household.Average.Income)
##      0%     25%     50%     75%    100% 
##   22158   65025   82856  108063 1065780
quantile(ksi_sf$Household.Average.Income, probs = c(0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.2, 0.3, 0.4, 0.5, 0.75, 1.00))
##         0%         1%         2%         3%         4%         5% 
##   22158.00   32346.80   36271.00   41648.00   43476.00   44432.00 
##         6%         7%         8%         9%        10%        20% 
##   46357.20   47771.75   49463.00   51250.00   51986.00   61510.00 
##        30%        40%        50%        75%       100% 
##   68358.00   76060.00   82856.00  108063.00 1065780.00

The quantile results are very interesting in that we can see that the bottom 25% of incomes actually varies between $0 and $65,000 and the highest value income being $10 million dollars higher.

This information combined with the histograms above give some indication that we will probably want to focus on the lower incomes between 30k and 150k.

We can use the following code to

ksi_sf$Average_Income3 <- if_else(ksi_sf$Household.Average.Income <= 30000, "30", "0")
ksi_sf$Average_Income4 <- if_else(ksi_sf$Household.Average.Income <= 40000, "40", "0")
ksi_sf$Average_Income5 <- if_else(ksi_sf$Household.Average.Income <= 50000, "50", "0")
ksi_sf$Average_Income6 <- if_else(ksi_sf$Household.Average.Income <= 60000, "60", "0")
ksi_sf$Average_Income7 <- if_else(ksi_sf$Household.Average.Income <= 70000, "70", "0")
ksi_sf$Average_Income8 <- if_else(ksi_sf$Household.Average.Income <= 80000, "80", "0")
ksi_sf$Average_Income9 <- if_else(ksi_sf$Household.Average.Income <= 90000, "90", "0")
ksi_sf$Average_Income10 <- if_else(ksi_sf$Household.Average.Income <= 100000, "100", "0")
ksi_sf$Average_Income11 <- if_else(ksi_sf$Household.Average.Income <= 110000, "110", "0")
ksi_sf$Average_Income12 <- if_else(ksi_sf$Household.Average.Income <= 120000, "120", "0")
ksi_sf$Average_Income13 <- if_else(ksi_sf$Household.Average.Income <= 130000, "130", "0")
ksi_sf$Average_Income14 <- if_else(ksi_sf$Household.Average.Income <= 140000, "140", "0")
ksi_sf$Average_Income15 <- if_else(ksi_sf$Household.Average.Income <= 150000, "150", "0")
ksi_sf$Average_Income16 <- if_else(ksi_sf$Household.Average.Income > 150000, "150+", "0")

ksi_sf$Average_Income <- if_else(ksi_sf$Average_Income3 > 0, ksi_sf$Average_Income3, if_else(ksi_sf$Average_Income4 > 0, ksi_sf$Average_Income4,
if_else(ksi_sf$Average_Income5 > 0, ksi_sf$Average_Income5, 
if_else(ksi_sf$Average_Income6 > 0, ksi_sf$Average_Income6, 
if_else(ksi_sf$Average_Income7 > 0, ksi_sf$Average_Income7, 
if_else(ksi_sf$Average_Income8 > 0, ksi_sf$Average_Income8, 
if_else(ksi_sf$Average_Income9 > 0, ksi_sf$Average_Income9, 
if_else(ksi_sf$Average_Income10 > 0, ksi_sf$Average_Income10, 
if_else(ksi_sf$Average_Income11 > 0, ksi_sf$Average_Income11,
if_else(ksi_sf$Average_Income12 > 0, ksi_sf$Average_Income12,
if_else(ksi_sf$Average_Income13 > 0, ksi_sf$Average_Income13,
if_else(ksi_sf$Average_Income14 > 0, ksi_sf$Average_Income14,
if_else(ksi_sf$Average_Income15 > 0, ksi_sf$Average_Income15,
if_else(ksi_sf$Average_Income16 > 0, ksi_sf$Average_Income16,"NA"))))))))))))))

ksi_sf <- ksi_sf[,-which(names(ksi_sf) %in% c("Average_Income3", "Average_Income4", "Average_Income5", "Average_Income6", "Average_Income7", "Average_Income8", "Average_Income9", "Average_Income10", "Average_Income11", "Average_Income12", "Average_Income13", "Average_Income14", "Average_Income15", "Average_Income16"))]

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(Average_Income), fill = "#0073C2FF")+
  ggtitle("Accidents by Average Household Income in thousands")

We can clearly see that the highest volume of pedestrian accidents are occurring in neighbourhoods where the average household income is between $60K and $90k. Assuming the average household income is attributable to the income of two earners that would equate to an average salary per person of $30k to $45k.

Comparing Variables

Having reviewed the variables at play in our data set we should take a moment to compare some of the more important variables to see if there is any correlation between factors

Comparing Variable Frequency by Year

We will quickly graph a few variable against the Year variable to see if the there are any obvious frequency changes year over year.

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(Average_Income),fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Accidents per Year by Average Income")+
  facet_wrap(~ YEAR)

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(x = YEAR, fill = District))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Accidents per District by Year")

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(x = YEAR, fill = PEDACT))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  ggtitle("Accidents per Pedestrian Action by Year")

Comparing Multiple Variables

A few of the variables reviewed above really stand out.

We saw based on the accident location that the majority of accidents occurred either at an intersection or at an undefined location. We also saw in looking at the vehicle action variable that vehicles most frequently hit pedestrians crossing mid-block, or crossing at an intersection while the vehicle turned right or left.

Accidents at intersections

If we focus on the accidents if would be interesting to know what sort of traffic controls were in place at the various locations where accidents occurred.

ggplot(data = ksi_sf) +
  geom_bar(mapping = aes(x = ACCLOC, fill = TRAFFCTL))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Accident Location")+
  ggtitle("Accidents by Location and Traffic Control")

We can see that when an accident occurred at an intersection, the intersection was typically controlled by traffic signal.

We would naturally assume that traffic signals, baring other driver related factors, would make intersections safer for pedestrians provided of course that the pedestrian is following their traffic signals. So is we want to determine if there is a driver related factor that could be negatively impacting the frequency of pedestrian accidents at intersections. Similarly, a majority of those same accidents involved drivers that were deemed to be driving excessively.

ggplot(filter(ksi_sf, ACCLOC ==c("At Intersection", "Intersection Related"))) +
  geom_bar(mapping = aes(VEHFACTORS), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Driver Influence")+
  ggtitle("Accidents at intersections by Driver Influence")

With slightly over half the accidents occurring at an intersection involving aggressive drivers, the question should be asked if the pedestrian is acting in a way that is increasing their likely hood of being hit. This can be answered by factoring in the Pedestrian Action variable.

ggplot(filter(ksi_sf, ACCLOC ==c("At Intersection", "Intersection Related"))) +
  geom_bar(mapping = aes(x = VEHFACTORS, fill = PEDACT))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Driver Influence")+
  ggtitle("Accidents at Intersections by Driver Influence and Pedestrian Action")

This tells us that when aggressive driving is a factor, pedestrian accidents at an intersection appear to be the result of driver error given that the vast majority of cases involve pedestrians crossing while they have the right of way.

By contrast, pedestrian accidents at an intersection that don’t involve aggressive driving seem to involve the pedestrian crossing without the right of way or at an intersection with no traffic control.

So if in the majority of accident involving pedestrians at intersections the pedestrian is crossing the street with the right of way, what is the driver doing that is causing an accident.

ggplot(filter(ksi_sf, ACCLOC ==c("At Intersection", "Intersection Related"))) +
  geom_bar(mapping = aes(x = VEHFACTORS, fill = PEDTYPE))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Driver Influence")+
  ggtitle("Accidents by Driver Influence by Vehicle Action")

This further clarifies what we identified in the graphs above. When aggressive driving is involved, Pedestrian accidents are occurring at intersections when the driver is turning left or right and much less frequently when they are driving straight through the intersection.

Accident Location not Identified

The other large grouping of accidents were those that occurred at locations that could not be identified. If we quickly run through the same exercise as above we will see: - A large majority of accidents do not involve any specified driver influence - 3 types of action seem to dominate running out into the roadway, getting on/off a school bus, and crossing where no traffic control is in place. - This correlates to a pedestrian most frequently being struck by a vehicle mid-block or while attempting to cross the roadway without the right of way.

ggplot(filter(ksi_sf, ACCLOC == " ")) +
  geom_bar(mapping = aes(VEHFACTORS), fill = "#0073C2FF")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Driver Influence")+
  ggtitle("Accidents at Undefined Locations by Driver Influence")

ggplot(filter(ksi_sf, ACCLOC == " ")) +
  geom_bar(mapping = aes(x = VEHFACTORS, fill = PEDACT))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Driver Influence")+
  ggtitle("Accidents at Undefined Locations by Driver Influence and Pedestrian Action")

ggplot(filter(ksi_sf, ACCLOC == " ")) +
  geom_bar(mapping = aes(x = VEHFACTORS, fill = PEDTYPE))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Driver Influence")+
  ggtitle("Accidents at Undefined Locations by Driver Influence and Vehicle Action")

Factoring for Income

We have already been able to identify the non economic factors that are influencing the likelihood of a loss. Now we need to explore how these factors are distributed by Average Income.

We have already observed that the the frequency and cause of a pedestrian accident correlates directly to how the driver is driving their vehicle and where the accident is occurring.

If we compare these against average income we can clearly see that as a pedestrian you are most at risk of being involved in an accident if you are in an area with average household income between 60k and 90k (30K to 45K per income earner)

We have also already observed that accident location can really be grouped into two groups. Accidents occurring at an intersection and accidents occurring at other points on the street such as mid-block. Graphing accident location details by income reinforces this grouping. As such going forward we will be filtering the data set as two groups: - At Intersection or Intersection related - No location identified or Non intersection

All other location variables will not be used.

ggplot(ksi_sf) +
  geom_bar(mapping = aes(x = Average_Income, fill = VEHFACTORS))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  facet_wrap(~ ACCLOC, scales = "free")+
  ggtitle(" Accidents grouped by Accident Location \n graphed by Average Income and Driver Influence")

Accidents occurring at an Intersection

We have already observed that the the frequency and cause of a pedestrian accident correlates directly to how the driver is driving their vehicle.

If we return to our Accidents at intersections by Driver Influence graph and overlay income data we will see that when a driver is driving their vehicle aggressively about 63% of accidents occur in low income areas. When the driver is not driving aggressively, about 60% of accidents occur in low income areas.

ggplot(filter(ksi_sf, ACCLOC == c("At Intersection", "Intersection Related"))) +
  geom_bar(mapping = aes(x = VEHFACTORS, fill = Average_Income))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Driver Influence")+
  ggtitle("Accidents at Intersections grouped by Driver influence and Average Income")

It is actually more informative if we re-scale and look at each driver influences separately. We will run two sets of graphs. The first will graph what action the pedestrian was taking when struck by a vehicle based on the factors that were influencing the driver where frequency is associated to average income. The first will graph what action the driver was taking when they struck a pedestrian based on the factors that were influencing the driver where frequency is associated to average income.

ggplot(filter(ksi_sf, ACCLOC == c("At Intersection", "Intersection Related"))) +
  geom_bar(aes(Average_Income, fill = PEDACT))+
  facet_wrap(~ VEHFACTORS, scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(filter(ksi_sf, ACCLOC == c("At Intersection", "Intersection Related"))) +
  geom_bar(aes(Average_Income, fill = PEDTYPE))+
  facet_wrap(~ VEHFACTORS, scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

From these graphs we can see that as a pedestrian you are most at risk of being involved in an accident if you are in an area with average household income between 60k and 90k (30K to 45K per income earner) However the likely cause of the accident changes based on how the driver of the vehicle is driving.

Aggressive Driving: - If the driver is driving aggressively, pedestrians are most likely to be hit: - while crossing at an intersection when they have the right of way because - the driver is turning instead of proceeding through the intersection.

No Driver Influence: - If the driver is under no influence, pedestrians are most likely to be hit: - while crossing at intersections where there is no traffic control or - while crossing at intersections when they do not have the right of way and - the driver is proceeding through the intersection

Speeding: - If the driver is speeding, pedestrians are most likely to be hit: - while walking on the sidewalk or shoulder of the street and is struck or - while crossing an intersection with or without the right of way and - the driver is proceeding through the intersection.

Accidents Not occurring at an intersection.

If we run the same graphs but filter our data to focus on accidents that did not occur at an intersection, the distribution of accidents by average income level doesn’t appear to have change that drastically. - Pedestrians are still most likely to be involved in an accident in an area with average household income between 60k and 90k (30K to 45K per income earner) - And our findings from above based on the primary three driver influences remain largely unchanged. The only major difference is that once the accidents are removed from the intersection, the impact speeding and aggressive driving have on overall accident frequency is significantly diminished.

ggplot(filter(ksi_sf, ACCLOC == c(" ","Non Intersection"))) +
  geom_bar(mapping = aes(x = VEHFACTORS, fill = Average_Income))+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))+
  xlab("Driver Influence")+
  ggtitle("Accidents at Intersections grouped by Driver influence and Average Income")

ggplot(filter(ksi_sf, ACCLOC == c(" ","Non Intersection"))) +
  geom_bar(aes(Average_Income, fill = PEDACT))+
  facet_wrap(~ VEHFACTORS, scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(filter(ksi_sf, ACCLOC == c(" ","Non Intersection"))) +
  geom_bar(aes(Average_Income, fill = PEDTYPE))+
  facet_wrap(~ VEHFACTORS, scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Traffic Control Measures

As we look at all our income related graphs we notice the same pattern there are decidedly fewer pedestrians in higher income areas (areas 110K and up). The question becomes do we have enough data to determine if there is a reason for the drop.

ggplot(ksi_sf) +
  geom_bar(aes(Average_Income, fill = TRAFFCTL))+
  facet_wrap(~ TRAFFCTL, scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(filter(ksi_sf, Average_Income == c("100", "110", "120","130", "140", "150", "150+"))) +
  geom_bar(aes(Average_Income, fill = TRAFFCTL))+
  facet_wrap(~ District, scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## Warning in Average_Income == c("100", "110", "120", "130", "140", "150", :
## longer object length is not a multiple of shorter object length

Conclusion

Our intention with this review was to determine if the socioeconomic status of a given area affects the likelihood of a pedestrian being hit in that area. We have seen that total number of accidents per year does not fluctuate widely. We have also seen that accidents occur in the same district at approximately the same proportion every year. And we have seen that no single factor or combination of factors seems to disrupt the distribution of accidents.

Ultimately, we have clearly demonstrated that if you are walking in a mid to low income neighbourhood you are at a higher risk of being hit by a vehicle than if you are walking in a higher income neighbourhood. However, there insufficient evidence in the data available to definitely identify average household income as the reason for this increased in risk.

We can however say that in Etobicoke, North York, and Scarborough there is a very evident issue in these low to middle income neighbourhoods of aggressive driving resulting in pedestrian accidents. Since we have also clearly linked these aggressive driving accidents to right and left hand turns, it may be wise for the city to investigate measures that could be put into place to prevent these issues such as dedicated periods for all lane pedestrian crossings, limiting left turns to designated turning windows, or staggering the pedestrian and vehicular traffic signals similar to jurisdictions such as Montreal to ensure that the pedestrian will be half way through the intersection before drivers receive the right to proceed.

ggplot(filter(ksi_sf, District == c("Etobicoke York", "North York", "Scarborough", "Toronto and East York"))) +
  geom_bar(aes(Average_Income, fill = VEHFACTORS))+
  facet_wrap(~ District, scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

fluctuate in the 175 to 200 item range only dropping below 175 three times and exceeding 200 four times over the 11 years of data.tends to be relatively We have been able to determine that if you are in an area with average household income between 60k and 90k (30K to 45K per income earner) However the likely cause of the accident changes based on how the driver of the vehicle is driving.

As per Statistics Canada the Median After-Tax Household Income in 2017 for the city of Toronto was $69,100. (Persons not in an economic family coming in at just below 32,400, and economic families coming in at 88,600k) Statistics Canada. Table 11-10-0190-01 Market income, government transfers, total income, income tax and after-tax income by economic family type

Using the Low-Income Measure (LIM) which defines the poverty rate as 50 per cent of the median household income that would set the LIM at 39K

ggplot(filter(ksi_sf, ACCLOC == c(" ","Non Intersection"))) +
  geom_bar(aes(Average_Income, fill = PEDACT))+
  facet_wrap(~ VEHFACTORS, scales = "free")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

On final graph of accidents in Toronto. Hover over each accident to view details on the map about the accident in question.

Lcontent <- paste("Year:", ksi_coord@data$YEAR,
                  "Income:",ksi_coord@data$Household.Average.Income, 
                  "Accident Location:",
                  ksi_coord@data$ACCLOC,
                  "Vehicle Action:", ksi_coord@data$PEDTYPE)

leaflet(ksi_coord) %>%
  addTiles() %>%
  addCircleMarkers(
    stroke = FALSE, fillOpacity = 0.5,weight = 1,
    label = ~as.character(Lcontent),
    clusterOptions = markerClusterOptions()) %>%
  addPolygons(data = nbh_modified,
              fillColor = "transparent", 
              color = "#000000",
              fillOpacity = 0.8,
              group = "Neighbourhood", 
              weight = 2) %>%
  addPolygons(data = ctr,
              fillColor = "transparent", 
              color = "#000000",
              fillOpacity = 0.8,
              group = "Census Tract", 
              weight = .8) %>%
  addPolygons(data = disem,
              fillColor = "transparent", 
              color = "#000000",
              fillOpacity = 0.8,
              group = "Dissemination Area", 
              weight = .4) %>%
  addLayersControl(overlayGroups =c("Neighbourhood", "Census Tract", "Dissemination Area"),
                   options = layersControlOptions(collapsed=FALSE))